Feature selection algorithms for CGH data
نویسندگان
چکیده
Comparative Genomic Hybridization (CGH), when combined with microarray technology, measures copy number alterations (gains or losses of DNA segments) of a large number of genomic intervals. Selecting a small number of discriminative intervals from these genomic intervals is an important problem, known as feature selection, for accurate classification of CGH data. An important aspect of CGH data is that consecutive intervals are highly correlated This paper considers the problem of feature selection for multiclass CGH data. There are two phases in our approach. The first phase reduces the number of features by using Support Vector Machines. We develop a novel approach to find the important intervals using the classifiers that are generated. In this approach, K classifiers are built to learn each of the K different cancer types. The second phase reduces the features found from the first phase further by using a novel dynamic programming algorithm. This algorithm removes redundant proximate features by exploiting the fact that consecutive intervals in CGH data are highly correlated. We also propose a systematic way to select CGH samples from a large CGH dataset at predefined similarities and sizes. These samples can serve as benchmarks for research on CGH data. The experimental results on these datasets demonstrate that our approach can reduce the number of features by a significant amount without reducing the predictive accuracy of classification.
منابع مشابه
Classification and feature selection algorithms for multi-class CGH data
UNLABELLED Recurrent chromosomal alterations provide cytological and molecular positions for the diagnosis and prognosis of cancer. Comparative genomic hybridization (CGH) has been useful in understanding these alterations in cancerous cells. CGH datasets consist of samples that are represented by large dimensional arrays of intervals. Each sample consists of long runs of intervals with losses ...
متن کاملSequential and Mixed Genetic Algorithm and Learning Automata (SGALA, MGALA) for Feature Selection in QSAR
Feature selection is of great importance in Quantitative Structure-Activity Relationship (QSAR) analysis. This problem has been solved using some meta-heuristic algorithms such as: GA, PSO, ACO, SA and so on. In this work two novel hybrid meta-heuristic algorithms i.e. Sequential GA and LA (SGALA) and Mixed GA and LA (MGALA), which are based on Genetic algorithm and learning automata for QSAR f...
متن کاملIFSB-ReliefF: A New Instance and Feature Selection Algorithm Based on ReliefF
Increasing the use of Internet and some phenomena such as sensor networks has led to an unnecessary increasing the volume of information. Though it has many benefits, it causes problems such as storage space requirements and better processors, as well as data refinement to remove unnecessary data. Data reduction methods provide ways to select useful data from a large amount of duplicate, incomp...
متن کاملA New Hybrid Method for Improving the Performance of Myocardial Infarction Prediction
Abstract Introduction: Myocardial Infarction, also known as heart attack, normally occurs due to such causes as smoking, family history, diabetes, and so on. It is recognized as one of the leading causes of death in the world. Therefore, the present study aimed to evaluate the performance of classification models in order to predict Myocardial Infarction, using a feature selection method tha...
متن کاملSequential and Mixed Genetic Algorithm and Learning Automata (SGALA, MGALA) for Feature Selection in QSAR
Feature selection is of great importance in Quantitative Structure-Activity Relationship (QSAR) analysis. This problem has been solved using some meta-heuristic algorithms such as: GA, PSO, ACO, SA and so on. In this work two novel hybrid meta-heuristic algorithms i.e. Sequential GA and LA (SGALA) and Mixed GA and LA (MGALA), which are based on Genetic algorithm and learning automata for QSAR f...
متن کامل